Single-cell allele-specific expression analysis reveals dynamic and cell-type-specific regulatory effects

Differential allele-specific expression (ASE) is a powerful tool to study context-specific cis-regulation of gene expression. Such effects can reflect the interaction between genetic or epigenetic factors and a measured context or condition. Single-cell RNA sequencing (scRNA-seq) allows the measurement of ASE at individual-cell resolution, but there is a lack of statistical methods to analyze such data. We present Differential Allelic Expression using Single-Cell data (DAESC), a powerful method for differential ASE analysis using scRNA-seq from multiple individuals, with statistical behavior confirmed through simulation. DAESC accounts for non-independence between cells from the same individual and incorporates implicit haplotype phasing. Application to data from 105 induced pluripotent stem cell (iPSC) lines identifies 657 genes dynamically regulated during endoderm differentiation, with enrichment for changes in chromatin state. Application to a type-2 diabetes dataset identifies several differentially regulated genes between patients and controls in pancreatic endocrine cells. DAESC is a powerful method for single-cell ASE analysis and can uncover novel insights on gene regulation.


Supplementary Notes
1 DAESC-BB model and inference

Model setup
For a gene or heterozygous transcribed SNP (tSNP), let y ij be the alternative allele read count of individual i and cell j, and n ij be the total allele-specific read count.Here i = 1, 2, ..., N and j = 1, 2, ..., J i .Let x ij be a length-p vector of independent variables.The DAESC-BB model is formulated as follows: The beta-binomial distribution for y ij is equivalent to y ij ⇠ binomial(n ij , p ij ), p ij ⇠ beta(µ ij / , (1 µ ij )/ ).The fixed e↵ects represents ASE and dynamic ASE e↵ects.The individual-specific random e↵ects a i 's capture the sample repeat structure due to having multiple cells per individual.
The complete data likelihood is Hence the complete-data log-likelihood is At iteration t, we approximate the conditional distribution P (a i | y i ; (t) , 2 a,(t) , (t) ) by N (â i,(t) , ˆ 2 ai,(t) ).See section 3 for derivation of the variational approximation.
Here (t) , 2 a,(t) and (t) are the current values at iteration t.The EM Q-function is expressed as The expectation does not have a closed-form solution.We use Gauss-Hermite quadrature to approximate it.Hence the Q-function can be approximated by Here (z m , w m ), m = 1, ..., M are the nodes and weights of a Gauss-Hermite quadrature for standard normal distribution.In practice, we found M = 3 nodes is su cient for approximating the Q function.

M-step
The update for 2 a has a closed form: The update for and is obtained by maximizing the Q function using Newton-Raphson.
2 DAESC-Mix model and inference

Model setup
DAESC-Mix is an extension of DAESC-BB incorporating a latent variable i for implicit haplotype phasing.
The model is formulated as follows: The variable i models the scenario where ASE is caused by one regulatory SNP (rSNP).When i = 1, the alternative allele of the eQTL and the alternative allele of the tSNP are on the same haplotype, and the reference alleles of the two SNPs are on the same haplotype.When i = 0, the alternative allele of the eQTL and the reference allele of the tSNP are on the same haplotype, and vice versa (Figure 1).

Parameter estimation by variational EM algorithm
We treat a i and i as missing data.The complete-data likelihood is The complete-data log-likelihood is At iteration t, we approximate the posterior distribution P (a i , i | y i ; (t) , 2 a,(t) , (t) , ⇡ 0,(t) ) using variational inference (Blei et al, 2017).Using the mean-field approximation q(a i , i ) = q(a i )q( i ), we update q( i ) and q(a i ) iteratively as follows.The update for q( i ) is The integrals are approximated by Gauss-Hermite quadrature (see Section 1.2.1 for details).The resulting distribution is a bernoulli distribution, denoted by Ber(⇡ i,(t) ).The variational update for a i is i This update has no closed form, but we approximate by a normal distribution N (â i,(t) , ˆ 2 ai,(t) ), as in Laplace variational inference (Wang and Blei, 2013).See section 3.2 for details.

M-step
• Update a by 2 a,(t+1) = 1 • Similar to section 1.2.2, update and by numerical maximization of E q( i ,ai) , 2 a , , ⇡ 0 )], which is the complete data log likelihood integrated over variation distribution q( i , a i ).The integration over q(a i ) is conducted numerically using Gaussian-Hermite quadrature.
3 Approximating posterior distribution P (a i | y i ) in the E-step

DAESC-BB
In the DAESC-BB model, the joint distribution of y i and a i is Define X i = (x i1 , ..., x iJi ) T , y i = (y i1 , ..., y iJi ) T and n i = (n i1 , ..., n iJi ) T .Define f , 2 a , ,Xi,y i ,ni (a i ) as the log joint distribution, i.e.
Here µ ij = (x T ij + a i ).To approximate P (a i | y i ; , 2 a , ), we derived the Taylor expansion of f , 2 a , ,Xi,y i ,ni (a i ).Denote by âi the value that maximizes f , 2 a , ,Xi,y i ,ni (a i ), we have Hence the posterior distribution P (a i | y i , , 2 a , ) can be approximated by N (â i , ˆ 2 ai ), where ˆ 2 ai = |f 00 , 2 a , ,Xi,y i ,ni (â i )| 1 .Now we derive the derivatives of f , 2 a , ,Xi,y i ,ni (a i ) and the Newton-Raphson algorithm to obtain âi .
Here (x) = d dx log (x) is the digamma function and 1 (x) = d 2 dx 2 log (x) is the trigamma function.To obtain âi , we use a modified Newton-Raphson method.At iteration k, we update a i as follows By default, we set ⌧ = 0.9.

DAESC-Mix
In the E-step for DAESC-Mix, the log variational distribution for a i is Here we dropped the subscript (t) for simpler notations.The derivatives can be computed as follows: Similar to section 3.1, we derive the maximum âi using Newton-Raphson updates and q(a i ) can be approximated by N (â i , ˆ 2 ai ), where ˆ 2 ai = |h 00 (â i )| 1 .We choose ⌧ = 0.9.

Simulate haplotype proportions
In the simulation studies, we vary the LD coe cient (r 2 ) between the eQTL and the tSNP.The simulation model, however, does not directly use the LD coe cient.Instead, it uses haplotype proportions determined by r 2 and the minor allele frequencies (MAF).
We start by introducing a few notations.Denote by g r1 the genotype of eQTL (regulatory SNP) in haplotype 1 and g r2 the genotype of eQTL in haplotype 2. Similarly define g e1 and g e2 as the genotypes of tSNP in haplotypes 1 and 2, respectively.Genotypes g r2 , g r2 , g e1 , g e2 can takes values 0 or 1. Denote by a r and a e the minor allele frequencies (MAF) of the eQTL and tSNP, respectively.We start by simulating a r from a uniform distribution: With a given r 2 , the possible values of a e are bounded by a r .We derive the bound by first deriving the relationship among r 2 , MAFs and haplotype frequencies.Note that Without loss of generality we assume r > 0. If r < 0 we can simply flip the reference and alternative alleles of one of the SNPs to ensure r > 0. Define the following haplotype frequencies: p 11 = P (g r1 = 1, g e1 = 1), p 10 = P (g r1 = 1, g e1 = 0) p 01 = P (g r1 = 0, g e1 = 1), p 00 = P (g r1 = 0, g e1 = 0) Hence p 11 = a r a e + r p a r a e (1 a r )(1 a e ).It needs to satisfy the restrictions p 11 < a r and p 11 < a e since the haplotype frequency cannot exceed corresponding allele frequencies of individual SNPs.This is equivalent to 1 a e .
Hence we can derive the bounds for a e : We simulate a e by uniform distribution: a e ⇠ U [ r 2 ar 1 ar+r 2 ar , These are the proportions of individuals for which the eQTL is heterozygous (⇡ 1 , ⇡2 ) or homozygous (⇡ 3 ) in the general population, regardless of whether the tSNP is heterozygous.However, we need to restrict to the individuals for which the tSNP is heterozygous, since ASE cannot be measured for homozygous individuals.
Hence we normalize the probabilities to get the final mixture probabilities: